Practice 2 - Fraud - EDA¶

The objective is to generate a model that, given a transaction, yields the probability that there is some type of fraud. In addition, it is essential to analyze the data, search for profiles, generate the appropriate descriptives, variable selection and transformation, etc.

Dataset Dictionary¶

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import requests
import warnings
warnings.filterwarnings('ignore')
In [3]:
# visualization options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
In [4]:
# data read
df = pd.read_csv('../data/raw/dataset_payments_fraud.csv', sep = ';')
In [5]:
df.tail()
Out[5]:
step type amount gender device connection_time nameOrig race oldbalanceOrg age newbalanceOrig zone user_number nameDest user_connections security_alert oldbalanceDest newbalanceDest isFraud
1048570 95 CASH_OUT 132557.35 woman mac 0,420243195 C1179511630 black 479803.00 39 347245.65 NaN 3179 C435674507 9 0 484329.37 616886.72 0
1048571 95 PAYMENT 9917.36 woman mac 0,37616924 C1956161225 black 90545.00 36 80627.64 africa 2558 M668364942 9 0 0.00 0.00 0
1048572 95 PAYMENT 14140.05 unknow pc 0,837108874 C2037964975 black 20545.00 31 6404.95 africa 1553 M1355182933 2 0 0.00 0.00 0
1048573 95 PAYMENT 10020.05 man mac 0,480931208 C1633237354 black 90605.00 94 80584.95 country 642 M1964992463 8 0 0.00 0.00 0
1048574 95 PAYMENT 11450.03 man pc 0,868197534 C1264356443 black 80584.95 84 69134.92 country 2081 M677577406 4 0 0.00 0.00 0

In order to obtain the connection_time variable as numeric, we change the "," for a ".". We will repeat this step during the data preprocessing.

In [6]:
df = df.assign(**{'connection_time': lambda df: df['connection_time'].str.replace(',', '.').astype(float)})
In [7]:
df.tail()
Out[7]:
step type amount gender device connection_time nameOrig race oldbalanceOrg age newbalanceOrig zone user_number nameDest user_connections security_alert oldbalanceDest newbalanceDest isFraud
1048570 95 CASH_OUT 132557.35 woman mac 0.420243 C1179511630 black 479803.00 39 347245.65 NaN 3179 C435674507 9 0 484329.37 616886.72 0
1048571 95 PAYMENT 9917.36 woman mac 0.376169 C1956161225 black 90545.00 36 80627.64 africa 2558 M668364942 9 0 0.00 0.00 0
1048572 95 PAYMENT 14140.05 unknow pc 0.837109 C2037964975 black 20545.00 31 6404.95 africa 1553 M1355182933 2 0 0.00 0.00 0
1048573 95 PAYMENT 10020.05 man mac 0.480931 C1633237354 black 90605.00 94 80584.95 country 642 M1964992463 8 0 0.00 0.00 0
1048574 95 PAYMENT 11450.03 man pc 0.868198 C1264356443 black 80584.95 84 69134.92 country 2081 M677577406 4 0 0.00 0.00 0

Dataset EDA¶

Observation of target variable isFraud¶

We take the target variable which describes the transactions that are fraud

In [8]:
plot_df = df['isFraud']\
        .value_counts(normalize = True)\
        .mul(100).rename('percent').reset_index()

plot_df_conteo = df['isFraud'].value_counts().reset_index()
plot_df_pc = pd.merge(plot_df, plot_df_conteo, 
on = ['index'], how = 'inner')
plot_df_pc 
Out[8]:
index percent isFraud
0 0 99.89109 1047433
1 1 0.10891 1142
In [9]:
fig = px.histogram(plot_df_pc, x = "index", y = ['percent'])
fig.show()

We observe that the target variable is completely unbalanced, since, as is normal, only a small percentage of transactions are fraudulent. In this case only 0.108%.

Variable types observation¶

In [9]:
df.dtypes
Out[9]:
step                  int64
type                 object
amount              float64
gender               object
device               object
connection_time     float64
nameOrig             object
race                 object
oldbalanceOrg       float64
age                   int64
newbalanceOrig      float64
zone                 object
user_number           int64
nameDest             object
user_connections      int64
security_alert        int64
oldbalanceDest      float64
newbalanceDest      float64
isFraud               int64
dtype: object

Duplicates observation¶

In [10]:
df.duplicated().sum()
Out[10]:
0

There are no duplicates in the dataset

Null observation¶

In [11]:
df_null_columns = df.isnull().sum().sort_values(ascending=False)
df_null_rows = df.isnull().sum(axis=1).sort_values(ascending=False)

null_columns = pd.DataFrame(df_null_columns, columns=['missing_values'])
null_rows = pd.DataFrame(df_null_rows, columns=['null_rows'])
null_rows['target'] = df['isFraud'].copy()
null_columns['porcentaje'] = null_columns['missing_values']/df.shape[0]*100
null_rows['rows_percentaje']= null_rows['null_rows']/df.shape[1]

null_columns
Out[11]:
missing_values porcentaje
race 105163 10.029135
gender 105118 10.024843
device 104580 9.973536
zone 104414 9.957705
step 0 0.000000
newbalanceDest 0 0.000000
oldbalanceDest 0 0.000000
security_alert 0 0.000000
user_connections 0 0.000000
nameDest 0 0.000000
user_number 0 0.000000
age 0 0.000000
newbalanceOrig 0 0.000000
type 0 0.000000
oldbalanceOrg 0 0.000000
nameOrig 0 0.000000
connection_time 0 0.000000
amount 0 0.000000
isFraud 0 0.000000

We observe 4 variables with null data or missing values. For race, we will drop it for our dataset, due to sensitive data. For the rest, we will treat null data as other outcome of the variable.

Outliers study¶

For the study of outliers, we first select those that are numeric (int or float) and then we do a range for each variable, to understan which one can have outliers.

In [12]:
df_numeric = df[['step', 'amount', 'connection_time','oldbalanceOrg', 'age', 'newbalanceOrig',
                'user_number', 'user_connections', 'security_alert', 'oldbalanceDest','newbalanceDest']]
pd.DataFrame({'min': df_numeric.min(), 'max': df_numeric.max()}).T
Out[12]:
step amount connection_time oldbalanceOrg age newbalanceOrig user_number user_connections security_alert oldbalanceDest newbalanceDest
min 1.0 0.1 2.618800e-08 0.00 5.0 0.00 59.0 1.0 0.0 0.00 0.00
max 95.0 10000000.0 9.999991e-01 38939424.03 100.0 38946233.02 5000.0 10.0 1.0 42054659.73 42169156.09

We can foresee five variables with possible outliers, being them the amount and the old/new balances from Origin/Destination.

In order to understand what can be called an outlier, we will use the _get_deviation_of_meanperc function with a multiplier of 2 in order to get those outside the Q1-2IQR and Q3+2IQR of each of these variables.

In [13]:
list_continuous_variables = ['amount', 'oldbalanceOrg', 'newbalanceOrig',
                              'oldbalanceDest', 'newbalanceDest']
In [14]:
 from functions import get_deviation_of_mean_perc
In [15]:
get_deviation_of_mean_perc(df, list_continuous_variables, target='isFraud', multiplier=2)
Out[15]:
0.0 1.0 variable sum_outlier_values porcentaje_sum_null_values
0 0.986908 0.013092 amount 32081 0.030595
1 0.998938 0.001062 oldbalanceOrg 45203 0.043109
2 0.999934 0.000066 newbalanceOrig 45698 0.043581
3 0.999480 0.000520 oldbalanceDest 40402 0.038530
4 0.998491 0.001509 newbalanceDest 41751 0.039817

The only outlier values that have. a greater proportion of fraud than the average dataset is amount. Due to its nature, we will use this data since it provides useful insights of when there is more probability to be fraud.

In [16]:
plt.figure(figsize=(16, 8))
sns.boxplot( data = df, y = 'amount', x = 'isFraud', palette="icefire")
plt.title('amount vs Fraud',fontweight="bold", size=20)
Out[16]:
Text(0.5, 1.0, 'amount vs Fraud')

Variables correlation¶

For easier visualization purposes we will rearrange the variables in the dataset, locating the balances and amount at the end of the dataset (but before the objetive varible isFraud). As stated before, in order to comply with the GDPR, we will not include the race variable

In [17]:
df = df[['step','type', 'gender', 'device', 'connection_time','age', 'zone', 'user_number', 
         'user_connections', 'security_alert', 'nameOrig', 'nameDest', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 
         'oldbalanceDest', 'newbalanceDest', 'isFraud']]
df.head()
Out[17]:
step type gender device connection_time age zone user_number user_connections security_alert nameOrig nameDest amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud
0 1 PAYMENT man mac 0.140039 85 capital 138 5 1 C1231006815 M1979787155 9839.64 170136.0 160296.36 0.0 0.0 0
1 1 PAYMENT woman mac 0.496890 57 country 909 1 0 C1666544295 M2044282225 1864.28 21249.0 19384.72 0.0 0.0 0
2 1 TRANSFER man pc 0.781150 66 capital 2569 10 0 C1305486145 C553264065 181.00 181.0 0.00 0.0 0.0 1
3 1 CASH_OUT man mac 0.565068 31 country 1787 3 0 C840083671 C38997010 181.00 181.0 0.00 21182.0 0.0 1
4 1 PAYMENT unknow mac 0.517114 90 country 3997 8 0 C2048537720 M1230701703 11668.14 41554.0 29885.86 0.0 0.0 0

Spearman¶

First, in order to know the correlation between variables, we will plot a heatmat of the spearman correlations between them. For that we will call the funtion _get_corrmatrix from the functions notebook. The function will only call for numerical variables. After, we will do a Cramer's V for all categorical variables (despite having numeric or string values).

In [18]:
from functions import get_corr_matrix

get_corr_matrix(dataset = df, size_figure = [10,8])
In [19]:
df.corr(method = 'spearman')
Out[19]:
step connection_time age user_number user_connections security_alert amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud
step 1.000000 0.000528 -0.000180 -0.000666 0.000129 -0.000862 -0.035927 -0.023625 -0.022146 0.007136 -0.001575 0.025968
connection_time 0.000528 1.000000 0.001739 -0.000451 0.000466 -0.519719 -0.000075 -0.000448 0.000088 -0.001316 -0.001251 0.000291
age -0.000180 0.001739 1.000000 0.001168 0.000635 0.000363 0.000168 0.000471 -0.000734 -0.000476 0.000243 0.000551
user_number -0.000666 -0.000451 0.001168 1.000000 -0.000555 -0.001605 0.000717 0.000491 -0.000147 0.000218 0.000799 -0.000821
user_connections 0.000129 0.000466 0.000635 -0.000555 1.000000 -0.000040 0.000679 -0.000732 -0.001120 0.000716 0.001114 0.000764
security_alert -0.000862 -0.519719 0.000363 -0.001605 -0.000040 1.000000 0.000543 0.000216 0.000236 0.001188 0.001610 0.000265
amount -0.035927 -0.000075 0.000168 0.000717 0.000679 0.000543 1.000000 0.030319 -0.092714 0.603029 0.671973 0.028488
oldbalanceOrg -0.023625 -0.000448 0.000471 0.000491 -0.000732 0.000216 0.030319 1.000000 0.814880 0.009227 -0.026446 0.030924
newbalanceOrig -0.022146 0.000088 -0.000734 -0.000147 -0.001120 0.000236 -0.092714 0.814880 1.000000 0.021501 -0.112781 -0.027402
oldbalanceDest 0.007136 -0.001316 -0.000476 0.000218 0.000716 0.001188 0.603029 0.009227 0.021501 1.000000 0.924923 -0.015825
newbalanceDest -0.001575 -0.001251 0.000243 0.000799 0.001114 0.001610 0.671973 -0.026446 -0.112781 0.924923 1.000000 -0.005359
isFraud 0.025968 0.000291 0.000551 -0.000821 0.000764 0.000265 0.028488 0.030924 -0.027402 -0.015825 -0.005359 1.000000

Naturally, we see a huge correlation between the old and new balances Origin as well as the old and new balances Destination. In a lower note, but still significant, we find correlation both balance Destination variables with amount, also as expected. Instead, it is remarkable that the correlation between old and new balanceOrigin is practically 0 and even negative (but a lower value) respectively.

On the other side we do find a significant inverse correlation (negative values), between security alert and connection time, which does not seem as expected. A priori someonw would though that those transactions with longer connection time will be those that have a security alert (a value 1). But having a negative correlation means the opposite.

Cramer's V¶

First we define the categorical variables (excluding the target variable)

In [20]:
df_categorical_variables = df [["type", "gender", "device", "zone", 
                                'security_alert']]
df_categorical_variables.head()
Out[20]:
type gender device zone security_alert
0 PAYMENT man mac capital 1
1 PAYMENT woman mac country 0
2 TRANSFER man pc capital 0
3 CASH_OUT man mac country 0
4 PAYMENT unknow mac country 0

Then, we call from the fucntions auxiliar notebook the _cramersv function, and we plot its heatmap

In [21]:
from functions import cramers_v

rows = []
for var1 in df_categorical_variables :
  col = []
  for var2 in df_categorical_variables :
    cramers = cramers_v(df_categorical_variables[var1], df_categorical_variables[var2]) # Cramer's V test
    col.append(round(cramers,2)) # Keeping of the rounded value of the Cramer's V  
  rows.append(col)
  
cramers_results = np.array(rows)
df_vcramer = pd.DataFrame(cramers_results, columns = df_categorical_variables .columns,
                          index = df_categorical_variables .columns)

sns.heatmap(df_vcramer, vmin=0, vmax=1,center = 0, square = True, 
            linewidths = .5, cmap = 'coolwarm')
Out[21]:
<AxesSubplot:>
In [22]:
df_vcramer
Out[22]:
type gender device zone security_alert
type 1.0 0.0 0.00 0.0 0.00
gender 0.0 1.0 0.00 0.0 0.00
device 0.0 0.0 1.00 0.0 0.57
zone 0.0 0.0 0.00 1.0 0.00
security_alert 0.0 0.0 0.57 0.0 1.00

There is only one significant correlation being the device and security alert.

Spearman¶

Finally we do a spearman correlation between the continous numerical variables, in case we can obtain any additional information than with the pearson correlation. For this, as before, we set a new dataframe with only the continous numerical variables.

In [23]:
df_continous_vairables = df [['step','connection_time','age', 'user_number', 
                              'user_connections', 'amount','oldbalanceOrg', 
                              'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]
df_continous_vairables.head()
Out[23]:
step connection_time age user_number user_connections amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest
0 1 0.140039 85 138 5 9839.64 170136.0 160296.36 0.0 0.0
1 1 0.496890 57 909 1 1864.28 21249.0 19384.72 0.0 0.0
2 1 0.781150 66 2569 10 181.00 181.0 0.00 0.0 0.0
3 1 0.565068 31 1787 3 181.00 181.0 0.00 21182.0 0.0
4 1 0.517114 90 3997 8 11668.14 41554.0 29885.86 0.0 0.0

Also, as with the Pearson one, we call from the functions auxiliar notebook the _get_corrmatrix but this time we change the "metodo" (method) to pearson.

In [24]:
get_corr_matrix(dataset = df_continous_vairables,
                metodo = 'pearson', size_figure = [10,8])
In [25]:
df_continous_vairables.corr(method = 'pearson')
Out[25]:
step connection_time age user_number user_connections amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest
step 1.000000 0.001068 0.000201 -0.000356 -0.000140 -0.025996 -0.006782 -0.007182 -0.002251 -0.019502
connection_time 0.001068 1.000000 0.001739 -0.000450 0.000466 0.001755 -0.001279 -0.001342 -0.000199 -0.000029
age 0.000201 0.001739 1.000000 0.001168 0.000634 0.000076 0.000155 0.000119 0.000441 0.000377
user_number -0.000356 -0.000450 0.001168 1.000000 -0.000554 0.001476 0.000379 0.000351 -0.001038 -0.000588
user_connections -0.000140 0.000466 0.000634 -0.000554 1.000000 0.000579 -0.001799 -0.001844 0.000902 0.001160
amount -0.025996 0.001755 0.000076 0.001476 0.000579 1.000000 0.004864 -0.001132 0.215562 0.311942
oldbalanceOrg -0.006782 -0.001279 0.000155 0.000379 -0.001799 0.004864 1.000000 0.999050 0.093305 0.064050
newbalanceOrig -0.007182 -0.001342 0.000119 0.000351 -0.001844 -0.001132 0.999050 1.000000 0.095179 0.063722
oldbalanceDest -0.002251 -0.000199 0.000441 -0.001038 0.000902 0.215562 0.093305 0.095179 1.000000 0.978401
newbalanceDest -0.019502 -0.000029 0.000377 -0.000588 0.001160 0.311942 0.064050 0.063722 0.978401 1.000000

We obtain similar results compared with the Spearman correlation only the amount and the "balances" are correlated.

Plotting variables vs the target variable¶

Since we do have NA in some variables, and as we stated at the beigining this null values will be treated as another outcome of the variable, for the purpose of the variables plotting these values will be changed to "Other".

In [26]:
na_values =  {'gender': 'Other', 'device': 'Other', 'zone': 'Other'}
df = df.fillna(value=na_values)

Variable: Step¶

In [27]:
df[df["isFraud"] == 0].hist(column="step",color="darkolivegreen",bins=95)
plt.xlabel("Step (unit = 1 hour)")
plt.ylabel("Number of Transancions")
plt.title("Number of Valid Transactions troughout the time")

df[df["isFraud"] == 1].hist(column ="step",color="darkred",bins=95)
plt.xlabel("Step (unit = 1 hour)")
plt.ylabel("Number of Transancions")
plt.title("Number of Fraudulent Transactions troughout the time")

plt.tight_layout()
plt.show()

First this to note is the fact that, as we can foresee, the scales are completely different, being the maximun value in the valid transactions more than 50,000 transactions per hour and for those that are fraud, less than 25. Despite this, it is notable the small number of valid transactions arround the hour 50 and the hour 90 (if any) and the maintenance of similar values of fraudulent hours during those times.

Variable: connection_time¶

In [28]:
df[df["isFraud"] == 0].hist(column="connection_time",color="darkolivegreen",bins=50)
plt.xlabel("Connection time (form 0 to 1 second)")
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on connection time")

df[df["isFraud"] == 1].hist(column ="connection_time",color="darkred",bins=50)
plt.xlabel("Connection time (form 0 to 1 second)")
plt.ylabel("Number of Transancions")
plt.title("Fraudulent transactions depending on connection time")

plt.tight_layout()
plt.show()

Again, bearing in mind with the scale, the connection time it is in fact less insightfull than the step. Nothing notable.

Variable: Amount¶

In [29]:
df[df["isFraud"] == 0].hist(column="amount",color="darkolivegreen",bins=50)
plt.xlabel("transactions amount")
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on amount")

df[df["isFraud"] == 1].hist(column ="amount",color="darkred",bins=50)
plt.xlabel("transactions amount")
plt.ylabel("Number of Transancions")
plt.title("Fraudulent transactions depending on amount")

plt.tight_layout()
plt.show()

Here we can reasure what was states in the outliers study: The fraudulent transactions proportion increases whenever the amount is higher.

Variable: user_number¶

In [30]:
df[df["isFraud"] == 0].hist(column="user_number",color="darkolivegreen",bins=50)
plt.xlabel("Number of concurrent users during the transaction")
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on concurrent users")

df[df["isFraud"] == 1].hist(column ="user_number",color="darkred",bins=50)
plt.xlabel("Number of concurrent users during the transaction")
plt.ylabel("Number of Transancions")
plt.title("Fruaduent transactions depending on concurrent users")

plt.tight_layout()
plt.show()

Bearing in mind with the scale, the connection time it is in fact less insightfull than the step. Nothing notable.

Variable: user_connections¶

In [31]:
df[df["isFraud"] == 0].hist(column="user_connections",color="darkolivegreen")
plt.xlabel("Number of connections of such user during the day")
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on number of connection of such user")

df[df["isFraud"] == 1].hist(column ="user_connections",color="darkred")
plt.xlabel("Number of connections of such user during the day")
plt.ylabel("Number of Transancions")
plt.title("Fruaduent transactions depending on number of connection of such user")

plt.tight_layout()
plt.show()

A priori, nothing to highlight.

Variable: type¶

In [32]:
df[df["isFraud"] == 0].type.value_counts().plot(kind='bar')
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on transaction type")

plt.show()
In [33]:
df[df["isFraud"] == 1].type.value_counts().plot(kind='bar')
plt.ylabel("Number of Transancions")
plt.title("Fraudulent transactions depending on transaction type")
plt.show()

We can observe that only cash_out and Transfer type of transactions are those in which fraud is committed. Also is noticiable higher the proportion of fraudulent transactions in Transfer than in cash out.

EDA Final conclusions¶

First thing to notice is that the Dataset is completely unbalanced, being the not fraud variables the 99.89% of the cases.

The fact that one of the variables is race, we will drop it and not analyze it to comply with the GDPR.

Also, there are some outliers, specially in amount, but those have almost 10 times more probability to be fraud compared with the whole dataset, hence we will use such data.

We have into account that we have not yet analyze the nameOrig and nameDest variables, since we forecast a transform in the variables. Both are composed by two diferent thing a letter (C or M) and the ID number of the client. This transformation will be performed in the next notebook.

There are some expected correlations between the balances as well as with amount and Destinatary balances. Also there are correlations between device type and security alert, as well as an negative correlation between connection time and security alert. Since we understand none of these variables represent the same thing, we will not drop any for our future models based on its correlation.

Highly noticiable is the variable steps and its behaviour with isFraud, that is so, that we will create a new variable representing the variables per step, as we have manage to understand that whenever there are lower transactions the probability of fraud skyrockets.

For transaction type, we have learn that only Transfers and Cash outs are the ones that can be Fraud.